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Abstract 


By decomposing the image formation process into a Se- 
quential application of denoising autoencoders, diffusion 
models (DMs) achieve state-of-the-art synthesis results on 
image data and beyond. Additionally, their formulation al- 
lows for a guiding mechanism to control the image gen- 
eration process without retraining. However, since these 
models typically operate directly in pixel space, optimiza- 
tion of powerful DMs often consumes hundreds of GPU 
days and inference is expensive due to sequential evalu- 
ations. To enable DM training on limited computational 
resources while retaining their quality and flexibility, we 
apply them in the latent space of powerful pretrained au- 
toencoders. In contrast to previous work, training diffusion 
models on such a representation allows for the first time 
to reach a near-optimal point between complexity reduc- 
tion and detail preservation, greatly boosting visual fidelity. 
By introducing cross-attention layers into the model archi- 
tecture, we turn diffusion models into powerful and flexi- 
ble generators for general conditioning inputs such as text 
or bounding boxes and high-resolution synthesis becomes 
possible in a convolutional manner. Our latent diffusion 
models (LDMs) achieve new State-of-the-art scores for im- 
age inpainting and class-conditional image synthesis and 
highly competitive performance on various tasks, includ- 
ing text-to-image synthesis, unconditional image generation 
and super-resolution, while significantly reducing computa- 
tional requirements compared to pixel-based DMs. 


1. Introduction 


Image synthesis is one of the computer vision fields with 
the most spectacular recent development, but also among 
those with the greatest computational demands. Espe- 
cially high-resolution synthesis of complex, natural scenes 
is presently dominated by scaling up likelihood-based mod- 
els, potentially containing billions of parameters in autore- 
gressive (AR) transformers [66,67]. In contrast, the promis- 
ing results of GANs [3, 27, 40] have been revealed to be 
mostly confined to data with comparably limited variability 
as their adversarial learning procedure does not easily scale 
to modeling complex, multi-modal distributions. Recently, 
diffusion models [82], which are built from a hierarchy of 
denoising autoencoders, have shown to achieve impressive 
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Figure 1. Boosting the upper bound on achievable quality with 
less agressive downsampling. Since diffusion models offer excel- 
lent inductive biases for spatial data, we do not need the heavy spa- 
tial downsampling of related generative models in latent space, but 
can still greatly reduce the dimensionality of the data via suitable 
autoencoding models, see Sec. 3. Images are from the DIV2K [1] 
validation set, evaluated at 512? px. We denote the spatial down- 
sampling factor by f. Reconstruction FIDs [29] and PSNR are 
calculated on ImageNet-val. [12]; see also Tab. 8. 


results in image synthesis [30,85] and beyond [7,45,48,57], 
and define the state-of-the-art in class-conditional image 
synthesis [15,31] and super-resolution [72]. Moreover, even 
unconditional DMs can readily be applied to tasks such 
as inpainting and colorization [85] or stroke-based syn- 
thesis [53], in contrast to other types of generative mod- 
els [19,46, 69]. Being likelihood-based models, they do not 
exhibit mode-collapse and training instabilities as GANs 
and, by heavily exploiting parameter sharing, they can 
model highly complex distributions of natural images with- 
out involving billions of parameters as in AR models [67]. 

Democratizing High-Resolution Image Synthesis DMs 
belong to the class of likelihood-based models, whose 
mode-covering behavior makes them prone to spend ex- 
cessive amounts of capacity (and thus compute resources) 
on modeling imperceptible details of the data [16,73]. Al- 
though the reweighted variational objective [30] aims to ad- 
dress this by undersampling the initial denoising steps, DMs 
are still computationally demanding, since training and 
evaluating such a model requires repeated function evalu- 
ations (and gradient computations) in the high-dimensional 
space of RGB images. As an example, training the most 
powerful DMs often takes hundreds of GPU days (e.g. 150 - 
1000 V 100 days in [15]) and repeated evaluations on a noisy 
version of the input space render also inference expensive, 
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so that producing 50k samples takes approximately 5 days 
[15] on a single A100 GPU. This has two consequences for 
the research community and users in general: Firstly, train- 
ing such a model requires massive computational resources 
only available to a small fraction of the field, and leaves a 
huge carbon footprint [65,86]. Secondly, evaluating an al- 
ready trained model is also expensive in time and memory, 
since the same model architecture must run sequentially for 
a large number of steps (e.g. 25 - 1000 steps in [15]). 

To increase the accessibility of this powerful model class 
and at the same time reduce its significant resource con- 
sumption, a method is needed that reduces the computa- 
tional complexity for both training and sampling. Reducing 
the computational demands of DMs without impairing their 
performance is, therefore, key to enhance their accessibility. 


Departure to Latent Space Our approach starts with 
the analysis of already trained diffusion models in pixel 
space: Fig. 2 shows the rate-distortion trade-off of a trained 
model. As with any likelihood-based model, learning can 
be roughly divided into two stages: First is a perceptual 
compression stage which removes high-frequency details 
but still learns little semantic variation. In the second stage, 
the actual generative model learns the semantic and concep- 
tual composition of the data (semantic compression). We 
thus aim to first find a perceptually equivalent, but compu- 
tationally more suitable space, in which we will train diffu- 
sion models for high-resolution image synthesis. 

Following common practice [11, 23, 66,67, 96], we sep- 
arate training into two distinct phases: First, we train 
an autoencoder which provides a lower-dimensional (and 
thereby efficient) representational space which is perceptu- 
ally equivalent to the data space. Importantly, and in con- 
trast to previous work [23,66], we do not need to rely on ex- 
cessive spatial compression, as we train DMs in the learned 
latent space, which exhibits better scaling properties with 
respect to the spatial dimensionality. The reduced complex- 
ity also provides efficient image generation from the latent 
space with a single network pass. We dub the resulting 
model class Latent Diffusion Models (LDMs). 

A notable advantage of this approach is that we need to 
train the universal autoencoding stage only once and can 
therefore reuse it for multiple DM trainings or to explore 
possibly completely different tasks [$1]. This enables effi- 
cient exploration of a large number of diffusion models for 
various image-to-image and text-to-image tasks. For the lat- 
ter, we design an architecture that connects transformers to 
the DM’s UNet backbone [71] and enables arbitrary types 
of token-based conditioning mechanisms, see Sec. 3.3. 

In sum, our work makes the following contributions: 

(i) In contrast to purely transformer-based approaches 
[23,66], our method scales more graceful to higher dimen- 
sional data and can thus (a) work on a compression level 
which provides more faithful and detailed reconstructions 
than previous work (see Fig. 1) and (b) can be efficiently 
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Figure 2. Illustrating perceptual and semantic compression: Most 
bits of a digital image correspond to imperceptible details. While 
DMs allow to suppress this semantically meaningless information 
by minimizing the responsible loss term, gradients (during train- 
ing) and the neural network backbone (training and inference) still 
need to be evaluated on all pixels, leading to superfluous compu- 
tations and unnecessarily expensive optimization and inference. 
We propose latent diffusion models (LDMs) as an effective gener- 
ative model and a separate mild compression stage that only elim- 
inates imperceptible details. Data and images from [30]. 


applied to high-resolution synthesis of megapixel images. 

(ii) We achieve competitive performance on multiple 
tasks (unconditional image synthesis, inpainting, stochastic 
super-resolution) and datasets while significantly lowering 
computational costs. Compared to pixel-based diffusion ap- 
proaches, we also significantly decrease inference costs. 

(iii) We show that, in contrast to previous work [93] 
which learns both an encoder/decoder architecture and a 
score-based prior simultaneously, our approach does not re- 
quire a delicate weighting of reconstruction and generative 
abilities. This ensures extremely faithful reconstructions 
and requires very little regularization of the latent space. 

(iv) We find that for densely conditioned tasks such 
as super-resolution, inpainting and semantic synthesis, our 
model can be applied in a convolutional fashion and render 
large, consistent images of ~ 1024? px. 

(v) Moreover, we design a general-purpose conditioning 
mechanism based on cross-attention, enabling multi-modal 
training. We use it to train class-conditional, text-to-image 
and layout-to-image models. 

(vi) Finally, we release pretrained latent diffusion 
and autoencoding models at https: //github. 
com/CompVis/latent-—diffusion which might be 
reusable for a various tasks besides training of DMs [8 1]. 


2. Related Work 

Generative Models for Image Synthesis The high di- 
mensional nature of images presents distinct challenges 
to generative modeling. Generative Adversarial Networks 
(GAN) [27] allow for efficient sampling of high resolution 
images with good perceptual quality [3,42], but are diffi- 


cult to optimize [2, 28,54] and struggle to capture the full 
data distribution [55]. In contrast, likelihood-based meth- 
ods emphasize good density estimation which renders op- 
timization more well-behaved. Variational autoencoders 
(VAE) [46] and flow-based models [18, 19] enable efficient 
synthesis of high resolution images [9, 44, 92], but sam- 
ple quality is not on par with GANs. While autoregressive 
models (ARM) [6, 10, 94, 95] achieve strong performance 
in density estimation, computationally demanding architec- 
tures [97] and a sequential sampling process limit them to 
low resolution images. Because pixel based representations 
of images contain barely perceptible, high-frequency de- 
tails [16,73], maximum-likelihood training spends a dispro- 
portionate amount of capacity on modeling them, resulting 
in long training times. To scale to higher resolutions, several 
two-stage approaches [23,67, 101, 103] use ARMs to model 
a compressed latent image space instead of raw pixels. 


Recently, Diffusion Probabilistic Models (DM) [82], 
have achieved state-of-the-art results in density estimation 
[45] as well as in sample quality [15]. The generative power 
of these models stems from a natural fit to the inductive bi- 
ases of image-like data when their underlying neural back- 
bone is implemented as a UNet [15, 30,71, 85]. The best 
synthesis quality is usually achieved when a reweighted ob- 
jective [30] is used for training. In this case, the DM corre- 
sponds to a lossy compressor and allow to trade image qual- 
ity for compression capabilities. Evaluating and optimizing 
these models in pixel space, however, has the downside of 
low inference speed and very high training costs. While 
the former can be partially adressed by advanced sampling 
strategies [47, 75, 84] and hierarchical approaches [31,93], 
training on high-resolution image data always requires to 
calculate expensive gradients. We adress both drawbacks 
with our proposed LDMs, which work on a compressed la- 
tent space of lower dimensionality. This renders training 
computationally cheaper and speeds up inference with al- 
most no reduction in synthesis quality (see Fig. 1). 


Two-Stage Image Synthesis To mitigate the shortcom- 
ings of individual generative approaches, a lot of research 
[11, 23, 67, 70, 101, 103] has gone into combining the 
strengths of different methods into more efficient and per- 
formant models via a two stage approach. VQ-VAEs [67, 
101] use autoregressive models to learn an expressive prior 
over a discretized latent space. [66] extend this approach to 
text-to-image generation by learning a joint distributation 
over discretized image and text representations. More gen- 
erally, [70] uses conditionally invertible networks to pro- 
vide a generic transfer between latent spaces of diverse do- 
mains. Different from VQ-VAEs, VQGANSs [23, 103] em- 
ploy a first stage with an adversarial and perceptual objec- 
tive to scale autoregressive transformers to larger images. 
However, the high compression rates required for feasible 
ARM training, which introduces billions of trainable pa- 
rameters [23, 66], limit the overall performance of such ap- 


proaches and less compression comes at the price of high 
computational cost [23,66]. Our work prevents such trade- 
offs, as our proposed LDMs scale more gently to higher 
dimensional latent spaces due to their convolutional back- 
bone. Thus, we are free to choose the level of compression 
which optimally mediates between learning a powerful first 
stage, without leaving too much perceptual compression up 
to the generative diffusion model while guaranteeing high- 
fidelity reconstructions (see Fig. 1). 

While approaches to jointly [93] or separately [80] learn 
an encoding/decoding model together with a score-based 
prior exist, the former still require a difficult weighting be- 
tween reconstruction and generative capabilities [11] and 
are outperformed by our approach (Sec. 4), and the latter 
focus on highly structured images such as human faces. 


3. Method 


To lower the computational demands of training diffu- 
sion models towards high-resolution image synthesis, we 
observe that although diffusion models allow to ignore 
perceptually irrelevant details by undersampling the corre- 
sponding loss terms [30], they still require costly function 
evaluations in pixel space, which causes huge demands in 
computation time and energy resources. 

We propose to circumvent this drawback by introducing 
an explicit separation of the compressive from the genera- 
tive learning phase (see Fig. 2). To achieve this, we utilize 
an autoencoding model which learns a space that is percep- 
tually equivalent to the image space, but offers significantly 
reduced computational complexity. 

Such an approach offers several advantages: (1) By leav- 
ing the high-dimensional image space, we obtain DMs 
which are computationally much more efficient because 
sampling is performed on a low-dimensional space. (ii) We 
exploit the inductive bias of DMs inherited from their UNet 
architecture [71], which makes them particularly effective 
for data with spatial structure and therefore alleviates the 
need for aggressive, quality-reducing compression levels as 
required by previous approaches [23, 66]. (iii) Finally, we 
obtain general-purpose compression models whose latent 
space can be used to train multiple generative models and 
which can also be utilized for other downstream applica- 
tions such as single-image CLIP-guided synthesis [25]. 


3.1. Perceptual Image Compression 


Our perceptual compression model is based on previous 
work [23] and consists of an autoencoder trained by com- 
bination of a perceptual loss [106] and a patch-based [33] 
adversarial objective [20, 23, 103]. This ensures that the re- 
constructions are confined to the image manifold by enforc- 
ing local realism and avoids bluriness introduced by relying 
solely on pixel-space losses such as Ly or L 1 objectives. 

More precisely, given an image x € R4?*“*3 in RGB 
space, the encoder € encodes x into a latent representa- 


tion z = €(x), and the decoder D reconstructs the im- 
age from the latent, giving = D(z) = D(E(x)), where 
z € R’*”x¢_ Importantly, the encoder downsamples the 
image by a factor f = H/h = W/w, and we investigate 
different downsampling factors f = 2”, with m € N. 

In order to avoid arbitrarily high-variance latent spaces, 
we experiment with two different kinds of regularizations. 
The first variant, KL-reg., imposes a slight KL-penalty to- 
wards a standard normal on the learned latent, similar to a 
VAE [46, 69], whereas VQ-reg. uses a vector quantization 
layer [96] within the decoder. This model can be interpreted 
as a VQGAN [23] but with the quantization layer absorbed 
by the decoder. Because our subsequent DM is designed 
to work with the two-dimensional structure of our learned 
latent space z = E(a), we can use relatively mild compres- 
sion rates and achieve very good reconstructions. This is 
in contrast to previous works [23,66], which relied on an 
arbitrary 1D ordering of the learned space z to model its 
distribution autoregressively and thereby ignored much of 
the inherent structure of z. Hence, our compression model 
preserves details of x better (see Tab. 8). The full objective 
and training details can be found in the supplement. 


3.2. Latent Diffusion Models 


Diffusion Models [82] are probabilistic models designed to 
learn a data distribution p(a) by gradually denoising a nor- 
mally distributed variable, which corresponds to learning 
the reverse process of a fixed Markov Chain of length 7’. 
For image synthesis, the most successful models [15,30,72] 
rely on a reweighted variant of the variational lower bound 
on p(x), which mirrors denoising score-matching [85]. 
These models can be interpreted as an equally weighted 
sequence of denoising autoencoders €9(x;,t);t = 1...T, 
which are trained to predict a denoised variant of their input 
xt, where x; is a noisy version of the input x. The corre- 
sponding objective can be simplified to (Sec. B) 


Lpm = Ex ewN(0,1).| lle — €9(xz,t)||3| , (1) 


with t uniformly sampled from {1,..., 7}. 
Generative Modeling of Latent Representations With 
our trained perceptual compression models consisting of € 
and D, we now have access to an efficient, low-dimensional 
latent space in which high-frequency, imperceptible details 
are abstracted away. Compared to the high-dimensional 
pixel space, this space is more suitable for likelihood-based 
generative models, as they can now (i) focus on the impor- 
tant, semantic bits of the data and (ii) train in a lower di- 
mensional, computationally much more efficient space. 
Unlike previous work that relied on autoregressive, 
attention-based transformer models in a highly compressed, 
discrete latent space [23,66, 103], we can take advantage of 
image-specific inductive biases that our model offers. This 
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Figure 3. We condition LDMs either via concatenation or by a 
more general cross-attention mechanism. See Sec. 3.3 


includes the ability to build the underlying UNet primar- 
ily from 2D convolutional layers, and further focusing the 
objective on the perceptually most relevant bits using the 
reweighted bound, which now reads 


Lip := Ee(x),e~N(0,1),t [Ile a €9(zt,t)|[5 - @) 


The neural backbone €g(0,t) of our model is realized as a 
time-conditional UNet [71]. Since the forward process is 
fixed, z; can be efficiently obtained from € during training, 
and samples from p(z) can be decoded to image space with 
a single pass through D. 


3.3. Conditioning Mechanisms 

Similar to other types of generative models [56, 83], 
diffusion models are in principle capable of modeling 
conditional distributions of the form p(z|y). This can 
be implemented with a conditional denoising autoencoder 
€9(z¢,t, y) and paves the way to controlling the synthesis 
process through inputs y such as text [68], semantic maps 
[33,61] or other image-to-image translation tasks [34]. 

In the context of image synthesis, however, combining 
the generative power of DMs with other types of condition- 
ings beyond class-labels [15] or blurred variants of the input 
image [72] is so far an under-explored area of research. 

We turn DMs into more flexible conditional image gener- 
ators by augmenting their underlying UNet backbone with 
the cross-attention mechanism [97], which is effective for 
learning attention-based models of various input modali- 
ties [35,36]. To pre-process y from various modalities (such 
as language prompts) we introduce a domain specific en- 
coder 79 that projects y to an intermediate representation 
To(y) € R“*4-, which is then mapped to the intermediate 


layers of the UNet via a cross-attention layer implementing 
Attention(Q, K,V) = softmax (2) - V, with 


Q= We) - vile), K = Wy? tay), V = Wy? - r0(y): 
Here, y;(%) € R%*% denotes a (flattened) intermediate 


representation of the UNet implementing €g and we de 
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Samples from LDMs trained on CelebAHQ [39], FFHQ [41], LSUN-Churches [102], LSUN-Bedrooms [102] and class- 


conditional ImageNet [12], each with a resolution of 256 x 256. Best viewed when zoomed in. For more samples cf. the supplement. 


R¢X4., ws? E R&X4 & WY © RIX are learnable pro- 
jection matrices [36,97]. See Fig. 3 for a visual depiction. 

Based on image-conditioning pairs, we then learn the 
conditional LDM via 


Lipm = Ee (x),y,e~N (0,1)st lle—eo (20, ¢, row) 13 ’ (3) 


where both 7% and €g are jointly optimized via Eq. 3. This 
conditioning mechanism is flexible as 7g can be parameter- 
ized with domain-specific experts, e.g. (unmasked) trans- 
formers [97] when y are text prompts (see Sec. 4.3.1) 


4. Experiments 


LDMs provide means to flexible and computationally 
tractable diffusion based image synthesis of various image 
modalities, which we empirically show in the following. 
Firstly, however, we analyze the gains of our models com- 
pared to pixel-based diffusion models in both training and 
inference. Interestingly, we find that LDMs trained in VQ- 
regularized latent spaces sometimes achieve better sample 
quality, even though the reconstruction capabilities of VQ- 
regularized first stage models slightly fall behind those of 
their continuous counterparts, cf. Tab. 8. A visual compari- 
son between the effects of first stage regularization schemes 
on LDM training and their generalization abilities to resolu- 
tions > 256? can be found in Appendix D.1. In E.2 we list 
details on architecture, implementation, training and evalu- 
ation for all results presented in this section. 


4.1. On Perceptual Compression Tradeoffs 


This section analyzes the behavior of our LDMs with dif- 
ferent downsampling factors f € {1,2,4,8, 16,32} (abbre- 
viated as LDM- f , where LDM-1 corresponds to pixel-based 
DMs). To obtain a comparable test-field, we fix the com- 
putational resources to a single NVIDIA A100 for all ex- 
periments in this section and train all models for the same 
number of steps and with the same number of parameters. 

Tab. 8 shows hyperparameters and reconstruction perfor- 
mance of the first stage models used for the LDMs com- 


pared in this section. Fig. 6 shows sample quality as a func- 
tion of training progress for 2M steps of class-conditional 
models on the ImageNet [12] dataset. We see that, 1) small 
downsampling factors for LDM-{1,2} result in slow train- 
ing progress, whereas ii) overly large values of f cause stag- 
nating fidelity after comparably few training steps. Revis- 
iting the analysis above (Fig. 1 and 2) we attribute this to 
i) leaving most of perceptual compression to the diffusion 
model and ii) too strong first stage compression resulting 
in information loss and thus limiting the achievable qual- 
ity. LDM-{4-16} strike a good balance between efficiency 
and perceptually faithful results, which manifests in a sig- 
nificant FID [29] gap of 38 between pixel-based diffusion 
(LDM-1) and LDM-8 after 2M training steps. 

In Fig. 7, we compare models trained on CelebA- 
HQ [39] and ImageNet in terms sampling speed for differ- 
ent numbers of denoising steps with the DDIM sampler [84] 
and plot it against FID-scores [29]. LDM-{4-8} outper- 
form models with unsuitable ratios of perceptual and con- 
ceptual compression. Especially compared to pixel-based 
LDM.-1, they achieve much lower FID scores while simulta- 
neously significantly increasing sample throughput. Com- 
plex datasets such as ImageNet require reduced compres- 
sion rates to avoid reducing quality. In summary, LDM-4 
and -8 offer the best conditions for achieving high-quality 
synthesis results. 


4.2. Image Generation with Latent Diffusion 

We train unconditional models of 256? images on 
CelebA-HQ [39], FFHQ [41], LSUN-Churches and 
-Bedrooms [102] and evaluate the i) sample quality and ii) 
their coverage of the data manifold using ii) FID [29] and 
ii) Precision-and-Recall [50]. Tab. 1 summarizes our re- 
sults. On CelebA-HQ, we report a new state-of-the-art FID 
of 5.11, outperforming previous likelihood-based models as 
well as GANs. We also outperform LSGM [93] where a la- 
tent diffusion model is trained jointly together with the first 
stage. In contrast, we train diffusion models in a fixed space 
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Figure 5. Samples for user-defined text prompts from our model for text-to-image synthesis, LDM-& (KL), which was trained on the 
LAION [78] database. Samples generated with 200 DDIM steps and 7 = 1.0. We use unconditional guidance [32] with s = 10.0. 
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Figure 6. Analyzing the training of class-conditional LDMs with 
different downsampling factors f over 2M train steps on the Im- 
ageNet dataset. Pixel-based LDM-1 requires substantially larger 
train times compared to models with larger downsampling factors 
(LDM-{4-16}). Too much perceptual compression as in LDM-32 
limits the overall sample quality. All models are trained on a sin- 
gle NVIDIA A100 with the same computational budget. Results 
obtained with 100 DDIM steps [84] and « = 0. 
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Figure 7. Comparing LDMs with varying compression on the 
CelebA-HQ (left) and ImageNet (right) datasets. Different mark- 
ers indicate {10, 20,50, 100, 200} sampling steps using DDIM, 
from right to left along each line. The dashed line shows the FID 
scores for 200 steps, indicating the strong performance of LDM- 
{4-8}. FID scores assessed on 5000 samples. All models were 
trained for 500k (CelebA) / 2M (ImageNet) steps on an A100. 


and avoid the difficulty of weighing reconstruction quality 
against learning the prior over the latent space, see Fig. 1-2. 

We outperform prior diffusion based approaches on all 
but the LSUN-Bedrooms dataset, where our score is close 
to ADM [15], despite utilizing half its parameters and re- 
quiring 4-times less train resources (see Appendix E.3.5). 


CelebA-HQ 256 x 256 FFHQ 256 x 256 


Method FID{ Prec. ¢ Recall T Method FID | Prec. f ~— Recall + 
DC-VAE [63] 15.8 ImageBART [21] 9.57 
VQGANST. [23] (k=400) 10.2 U-Net GAN (+aug) [77] 10.9 (7.6) 
PGGAN [39] 8.0 - - UDM [43] 5.54 - - 
LSGM [93] 7.22 : StyleGAN [41] 4.16 O71 0.46 
UDM [43] 716 ProjectedGAN [76] 3.08 0.65 0.46 
LDM-4 (ours, 500-st) 5.11 0.72 0.49 LDM-4 (ours, 200-s) 4.98 0.73 0.50 


LSUN-Churches 256 x 256 LSUN-Bedrooms 256 x 256 


Method FID{ Prec. t ~~ Recall + Method FID| Prec. f Recall ¢ 
DDPM [30] 7.89 ImageBART [21] 551 
ImageBART [21] 7.32 : DDPM [30] 49 
PGGAN [39] 6.42 - : UDM [43] 4.57 : ss 
StyleGAN [41] 4.21 StyleGAN [41] 2.35 0.59 0.48 
StyleGAN2 [42] 3.86 - - ADM [15] 1.90 0.66 0.51 
ProjectedGAN [76] 1.59 0.61 0.44 ProjectedGAN [76] 1.52 0.61 0.34 
LDM-S8* (ours, 200-s) 4.02 0.64 0.52 LDM-4 (ours, 200-s) 2.95 0.66 0.48 


Table 1. Evaluation metrics for unconditional image synthesis. 
CelebA-HQ results reproduced from [43, 63, 100], FFHQ from 
[42, 43]. t. N-s refers to N sampling steps with the DDIM [84] 
sampler. *: trained in KL-regularized latent space. Additional re- 
sults can be found in the supplementary. 


Text-Conditional Image Synthesis 


Method FID | Ist Nparams 

CogView! [17] 27.10 18.20 4B self-ranking, rejection rate 0.017 
LAFITE? [109] 26.94 26.02 75M 

GLIDE* [59] 12.24 - 6B 277 DDIM steps, c.f.g. [32] s = 3 
Make-A-Scene* [26] 11.84 - 4B c.f.g for AR models [98] s = 5 
LDM-KL-8 23.31 = 20.03 +023 1.45B 250 DDIM steps 
LDM-KL-8-G* 12.63 30.29+0.42 1.45B 250 DDIM steps, c.f.g. [32] s = 1.5 


Table 2. Evaluation of text-conditional image synthesis on the 
256 x 256-sized MS-COCO [51] dataset: with 250 DDIM [84] 
steps our model is on par with the most recent diffusion [59] and 
autoregressive [26] methods despite using significantly less pa- 
rameters. '/*:Numbers from [109]/ [26] 


Moreover, LDMs consistently improve upon GAN-based 
methods in Precision and Recall, thus confirming the ad- 
vantages of their mode-covering likelihood-based training 
objective over adversarial approaches. In Fig. 4 we also 
show qualitative results on each dataset. 


Figure 8. Layout-to-image synthesis with an LDM on COCO [4], 
see Sec. 4.3.1. Quantitative evaluation in the supplement D.3. 


4.3. Conditional Latent Diffusion 


4.3.1 Transformer Encoders for LDMs 


By introducing cross-attention based conditioning into 
LDMs we open them up for various conditioning modali- 
ties previously unexplored for diffusion models. For text- 
to-image image modeling, we train a 1.45B parameter 
KL-regularized LDM conditioned on language prompts on 
LAION-400M [78]. We employ the BERT-tokenizer [14] 
and implement 7% as a transformer [97] to infer a latent 
code which is mapped into the UNet via (multi-head) cross- 
attention (Sec. 3.3). This combination of domain specific 
experts for learning a language representation and visual 
synthesis results in a powerful model, which generalizes 
well to complex, user-defined text prompts, cf. Fig. 8 and 5. 
For quantitative analysis, we follow prior work and evaluate 
text-to-image generation on the MS-COCO [51] validation 
set, where our model improves upon powerful AR [17, 66] 
and GAN-based [109] methods, cf. Tab. 2. We note that ap- 
plying classifier-free diffusion guidance [32] greatly boosts 
sample quality, such that the guided LDM-KL-8-G is on par 
with the recent state-of-the-art AR [26] and diffusion mod- 
els [59] for text-to-image synthesis, while substantially re- 
ducing parameter count. To further analyze the flexibility of 
the cross-attention based conditioning mechanism we also 
train models to synthesize images based on semantic lay- 
outs on OpenImages [49], and finetune on COCO [4], see 
Fig. 8. See Sec. D.3 for the quantitative evaluation and im- 
plementation details. 

Lastly, following prior work [3, 15, 21, 23], we evalu- 
ate our best-performing class-conditional ImageNet mod- 
els with f € {4,8} from Sec. 4.1 in Tab. 3, Fig. 4 and 
Sec. D.4. Here we outperform the state of the art diffu- 
sion model ADM [15] while significantly reducing compu- 
tational requirements and parameter count, cf. Tab 18. 


4.3.2. Convolutional Sampling Beyond 2567 
By concatenating spatially aligned conditioning informa- 
tion to the input of €g, LDMs can serve as efficient general- 


Method FID, Ist Precisiont —_Recallt Nparams 
BigGan-deep [3] 6.95 203.6+2.6 0.87 0.28 340M 


250 DDIM steps 


ADM [15] 10.94 100.98 0.69 0.63 554M 

ADM-G [15] 4.59 186.7 0.82 0.52 608M 250 DDIM steps 
LDM-4 (ours) 10.56 103.49+1.24 0.71 0.62 400M 250 DDIM steps 
LDM-4-G (ours) 3.60 247.67+5.59 0.87 0.48 400M 250 steps, c.f.g [32], s = 1.5 


Table 3. Comparison of a class-conditional ImageNet LDM with 
recent state-of-the-art methods for class-conditional image gener- 
ation on ImageNet [12]. A more detailed comparison with addi- 
tional baselines can be found in D.4, Tab. 10 and F. c.fg. denotes 
classifier-free guidance with a scale s as proposed in [32]. 


purpose image-to-image translation models. We use this 
to train models for semantic synthesis, super-resolution 
(Sec. 4.4) and inpainting (Sec. 4.5). For semantic synthe- 
sis, we use images of landscapes paired with semantic maps 
[23, 61] and concatenate downsampled versions of the se- 
mantic maps with the latent image representation of a f = 4 
model (VQ-reg., see Tab. 8). We train on an input resolution 
of 256? (crops from 3847) but find that our model general- 
izes to larger resolutions and can generate images up to the 
megapixel regime when evaluated in a convolutional man- 
ner (see Fig. 9). We exploit this behavior to also apply the 
super-resolution models in Sec. 4.4 and the inpainting mod- 
els in Sec. 4.5 to generate large images between 512? and 
10247. For this application, the signal-to-noise ratio (in- 
duced by the scale of the latent space) significantly affects 
the results. In Sec. D.1 we illustrate this when learning an 
LDM on (1) the latent space as provided by a f = 4 model 
(KL-reg., see Tab. 8), and (ii) a rescaled version, scaled by 
the component-wise standard deviation. 

The latter, in combination with classifier-free guid- 
ance [32], also enables the direct synthesis of > 2562 im- 
ages for the text-conditional LDM-KL-8-G as in Fig. 13. 


Figure 9. A LDM trained on 2567 resolution can generalize to 
larger resolution (here: 512 x 1024) for spatially conditioned tasks 
such as semantic synthesis of landscape images. See Sec. 4.3.2. 


4.4. Super-Resolution with Latent Diffusion 

LDMs can be efficiently trained for super-resolution by 
diretly conditioning on low-resolution images via concate- 
nation (cf. Sec. 3.3). In a first experiment, we follow SR3 


Figure 10. ImageNet 64—+256 super-resolution on ImageNet-Val. 
LDM-SR has advantages at rendering realistic textures but SR3 
can synthesize more coherent fine structures. See appendix for 
additional samples and cropouts. SR3 results from [72]. 


[72] and fix the image degradation to a bicubic interpola- 
tion with 4x-downsampling and train on ImageNet follow- 
ing SR3’s data processing pipeline. We use the f = 4 au- 
toencoding model pretrained on OpenImages (VQ-reg., cf. 
Tab. 8) and concatenate the low-resolution conditioning y 
and the inputs to the UNet, i.e. 79 is the identity. Our quali- 
tative and quantitative results (see Fig. 10 and Tab. 5) show 
competitive performance and LDM-SR outperforms SR3 
in FID while SR3 has a better IS. A simple image regres- 
sion model achieves the highest PSNR and SSIM scores; 
however these metrics do not align well with human per- 
ception [106] and favor blurriness over imperfectly aligned 
high frequency details [72]. Further, we conduct a user 
study comparing the pixel-baseline with LDM-SR. We fol- 
low SR3 [72] where human subjects were shown a low-res 
image in between two high-res images and asked for pref- 
erence. The results in Tab. 4 affirm the good performance 
of LDM-SR. PSNR and SSIM can be pushed by using a 
post-hoc guiding mechanism [15] and we implement this 
image-based guider via a perceptual loss, see Sec. D.6. 


SR on ImageNet Inpainting on Places 
User Study Pixel-DM (f1) LDM-4 LAMA [88] | LDM-4 
Task 1: Preference vs GT + 16.0% 30.4% 13.6% 21.0% 
Task 2: Preference Score + 29.4% 70.6% 31.9% 68.1% 


Table 4. Task 1: Subjects were shown ground truth and generated 
image and asked for preference. Task 2: Subjects had to decide 
between two generated images. More details in E.3.6 


Since the bicubic degradation process does not generalize 
well to images which do not follow this pre-processing, we 
also train a generic model, LDM-BSR, by using more di- 
verse degradation. The results are shown in Sec. D.6.1. 


Method FID | ISt PSNRt SSIMt Noparams — (!*5]¢*) 


Image Regression [72] 15.2 121.1 27.9 0.801 625M N/A 


SR3 [72] 5.2 180.1 26.4 0.762 625M N/A 
LDM-4 (ours, 100 steps) 2.81/4.8t 166.3 24.4438 0.694014 169M 4.62 
emphLDM-4 (ours, big, 100 steps) 2.4/4.3 = 174.9. 24.7441 0.714015 552M 45 
LDM-4 (ours, 50 steps, guiding) 4.4+76.4¢ 153.7 25.8437 0.744012 184M 0.38 


Table 5. x4 upscaling results on ImageNet-Val. (2567); ': FID 
features computed on validation split, ': FID features computed 
on train split; *: Assessed on a NVIDIA A100 


train throughput — sampling throughput? traintval FID@2k 


Model (reg.-type) samples/sec. @256 @512  hours/epoch epoch6 
LDM-1 (no first stage) 0.11 0.26 0.07 20.66 24.74 
LDM-4 (KL, w/ attn) 0.32 0.97 0.34 7.66 15.21 
LDM-4 (VQ, w/ attn) 0.33 0.97 0.34 7.04 14.99 
LDM-4 (VQ, w/o attn) 0.35 0.99 0.36 6.66 15.95 


Table 6. Assessing inpainting efficiency. ': Deviations from Fig. 7 
due to varying GPU settings/batch sizes cf. the supplement. 


4.5. Inpainting with Latent Diffusion 


Inpainting is the task of filling masked regions of an im- 
age with new content either because parts of the image are 
are corrupted or to replace existing but undesired content 
within the image. We evaluate how our general approach 
for conditional image generation compares to more special- 
ized, state-of-the-art approaches for this task. Our evalua- 
tion follows the protocol of LaMa [88], a recent inpainting 
model that introduces a specialized architecture relying on 
Fast Fourier Convolutions [8]. The exact training & evalua- 
tion protocol on Places [108] is described in Sec. E.2.2. 

We first analyze the effect of different design choices for 
the first stage. In particular, we compare the inpainting ef- 
ficiency of LDM-1 (i.e. a pixel-based conditional DM) with 
LDM.-4, for both KL and VQ regularizations, as well as VQ- 
LDM-4 without any attention in the first stage (see Tab. 8), 
where the latter reduces GPU memory for decoding at high 
resolutions. For comparability, we fix the number of param- 
eters for all models. Tab. 6 reports the training and sampling 
throughput at resolution 256? and 5127, the total training 
time in hours per epoch and the FID score on the validation 
split after six epochs. Overall, we observe a speed-up of at 
least 2.7 x between pixel- and latent-based diffusion models 
while improving FID scores by a factor of at least 1.6 x. 

The comparison with other inpainting approaches in 
Tab. 7 shows that our model with attention improves the 
overall image quality as measured by FID over that of [88]. 
LPIPS between the unmasked images and our samples is 
slightly higher than that of [88]. We attribute this to [88] 
only producing a single result which tends to recover more 
of an average image compared to the diverse results pro- 
duced by our LDM cf. Fig. 21. Additionally in a user study 
(Tab. 4) human subjects favor our results over those of [88]. 

Based on these initial results, we also trained a larger dif- 
fusion model (big in Tab. 7) in the latent space of the VQ- 
regularized first stage without attention. Following [15], 
the UNet of this diffusion model uses attention layers on 
three levels of its feature hierarchy, the BigGAN [3] residual 
block for up- and downsampling and has 387M parameters 


input result 


Figure 11. Qualitative results on object removal with our big, w/ 
ft inpainting model. For more results, see Fig. 22. 


instead of 215M. After training, we noticed a discrepancy 
in the quality of samples produced at resolutions 256? and 
5127, which we hypothesize to be caused by the additional 
attention modules. However, fine-tuning the model for half 
an epoch at resolution 512? allows the model to adjust to 
the new feature statistics and sets a new state of the art FID 
on image inpainting (big, w/o attn, w/ ft in Tab. 7, Fig. 11.). 


5. Limitations & Societal Impact 


Limitations While LDMs significantly reduce computa- 
tional requirements compared to pixel-based approaches, 
their sequential sampling process is still slower than that 
of GANs. Moreover, the use of LDMs can be question- 
able when high precision is required: although the loss of 
image quality is very small in our f = 4 autoencoding mod- 
els (see Fig. 1), their reconstruction capability can become 
a bottleneck for tasks that require fine-grained accuracy in 
pixel space. We assume that our superresolution models 
(Sec. 4.4) are already somewhat limited in this respect. 


Societal Impact Generative models for media like im- 
agery are a double-edged sword: On the one hand, they 


40-50% masked All samples 


Method FID | LPIPS| FIDJ LPIPS | 
LDM-4 (ours, big, w/ ft) 9.39 0.246 0.042 1.50 0.1374 0.080 
LDM.-4 (ours, big, w/o ft) 12.89 0.2574 0.047 2.40 0.142+ 0.085 
LDM-4 (ours, w/ attn) 11.87 0.257+ 0.042 2.15 0.144+ 0.084 
LDM-4 (ours, w/o attn) 12.60 0.2594 0.041 2.37 0.145+ 0.084 
LaMa [88]? 12.31 0.243+ 0.038 2.23 0.1344 0.080 
LaMa [88] 12.0 0.24 2.21 0.14 
CoModGAN [107] 10.4 0.26 1.82 0.15 
RegionWise [52] 21.3 0.27 4.75 0.15 
DeepFill v2 [104] 22.1 0.28 5.20 0.16 
EdgeConnect [58] 30.5 0.28 8.37 0.16 


Table 7. Comparison of inpainting performance on 30k crops of 
size 512 x 512 from test images of Places [108]. The column 40- 
50% reports metrics computed over hard examples where 40-50% 
of the image region have to be inpainted. ‘recomputed on our test 
set, since the original test set used in [88] was not available. 


enable various creative applications, and in particular ap- 
proaches like ours that reduce the cost of training and in- 
ference have the potential to facilitate access to this tech- 
nology and democratize its exploration. On the other hand, 
it also means that it becomes easier to create and dissemi- 
nate manipulated data or spread misinformation and spam. 
In particular, the deliberate manipulation of images (“deep 
fakes’) is a common problem in this context, and women in 
particular are disproportionately affected by it [13, 24). 

Generative models can also reveal their training data 
[5, 90], which is of great concern when the data contain 
sensitive or personal information and were collected with- 
out explicit consent. However, the extent to which this also 
applies to DMs of images is not yet fully understood. 

Finally, deep learning modules tend to reproduce or ex- 
acerbate biases that are already present in the data [22, 38, 
91]. While diffusion models achieve better coverage of the 
data distribution than e.g. GAN-based approaches, the ex- 
tent to which our two-stage approach that combines adver- 
sarial training and a likelihood-based objective misrepre- 
sents the data remains an important research question. 

For a more general, detailed discussion of the ethical 
considerations of deep generative models, see e.g. [13]. 


6. Conclusion 


We have presented latent diffusion models, a simple and 
efficient way to significantly improve both the training and 
sampling efficiency of denoising diffusion models with- 
out degrading their quality. Based on this and our cross- 
attention conditioning mechanism, our experiments could 
demonstrate favorable results compared to state-of-the-art 
methods across a wide range of conditional image synthesis 
tasks without task-specific architectures. 


This work has been supported by the German Federal Ministry for 
Economic Affairs and Energy within the project ’KIJ-Absicherung - Safe 
AI for automated driving’ and by the German Research Foundation (DFG) 
project 421703927. 
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Appendix 


Figure 12. Convolutional samples from the semantic landscapes model as in Sec. 4.3.2, finetuned on 512? images. 


14 


’A painting of the last supper by Picasso.’ 


’An epic painting of Gandalf the Black 
‘An oil painting of a latent space.’ summoning thunder and lightning in the mountains.’ 


’A sunset over a mountain range, vector image.’ 


Figure 13. Combining classifier free diffusion guidance with the convolutional sampling strategy from Sec. , our 1.45B parameter 
text-to-image model can be used for rendering images larger than the native 2567 resolution the model was trained on. 
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A. Changelog 


Here we list changes between this version (https: //arxiv.org/abs/2112.10752v2) of the paper and the 
previous version, i.e. https://arxiv.org/abs/2112.10752vl1. 


¢ We updated the results on text-to-image synthesis in Sec. 4.3 which were obtained by training a new, larger model (1.45B 
parameters). This also includes a new comparison to very recent competing methods on this task that were published on 
arXiv at the same time as ( [59, 109]) or after ( [26]) the publication of our work. 


¢ We updated results on class-conditional synthesis on ImageNet in Sec. 4.1, Tab. 3 (see also Sec. D.4) obtained by 
retraining the model with a larger batch size. The corresponding qualitative results in Fig. 26 and Fig. 27 were also 
updated. Both the updated text-to-image and the class-conditional model now use classifier-free guidance [32] as a 
measure to increase visual fidelity. 


¢ We conducted a user study (following the scheme suggested by Saharia et al [72]) which provides additional evaluation 
for our inpainting (Sec. 4.5) and superresolution models (Sec. 4.4). 


e Added Fig. 5 to the main paper, moved Fig. 18 to the appendix, added Fig. 13 to the appendix. 
B. Detailed Information on Denoising Diffusion Models 


2 
Diffusion models can be specified in terms of a signal-to-noise ratio SNR(t) = “5 consisting of sequences (a;)/_, and 
t 


(o;)/_, which, starting from a data sample zo, define a forward diffusion process g as 


q(xi|%0) = N (a:\a4%0, 071) (4) 
with the Markov structure for s < t: 
q(x+|¢s) =N(xtla4525, 07,51) (5) 
a 
Qs = — (6) 
As 
Cis = Ff — O45Fe (7) 


Denoising diffusion models are generative models p(xo) which revert this process with a similar Markov structure running 
backward in time, i.e. they are specified as 


T 
plo) = f rer) [] oars) (8) 


The evidence lower bound (ELBO) associated with this model then decomposes over the discrete time steps as 


— log p(zo) < KL(q(x7|x0)|p(xr)) + ye E q(x eo) KL(9(re-1|2t, £0)|p(te-1|2+)) (9) 


t=1 


The prior p(x) is typically choosen as a standard normal distribution and the first term of the ELBO then depends only on 
the final signal-to-noise ratio SNR(T). To minimize the remaining terms, a common choice to parameterize p(x4—1|2+) is to 
specify it in terms of the true posterior ¢(24~1|2+, 20) but with the unknown 2 replaced by an estimate x9(x;,t) based on 
the current step x. This gives [45] 


playa) = gee |en ce(tn2)) (10) 
2 
Op 
= N(a1-1|Ho (2; t); 02-1), (11) 
Ot 
where the mean can be expressed as 
2 2 
Qt|}t-10¢_ Op —-1 7 44 
bat) =— a ee. (12) 
Ot Ot 


In this case, the sum of the ELBO simplify to 


t=1 t=1 
Following [30], we use the reparameterization 
€o(x4,t) = (ae — A¢%9(2z, t))/ot 


to express the reconstruction term as a denoising objective, 


ro 
|[zo — 2o(zZo + axe, t)||?= alle — €9(QzX%o + O4€, t)||? 
t 


and the reweighting, which assigns each of the terms the same weight and results in Eq. (1). 
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T 
: : 1 
y FE a(x4|e0) KL(¢(te-1|2t, 20)|p(@e-1) = > EW (10.1) 5 (SNR(t — 1) — SNR(t))||20 — zo (axxo + ore, t)||? (13) 


(14) 


(15) 


C. Image Guiding Mechanisms 


Samples 256? Guided Convolutional Samples 512? Convolutional Samples 512? 


Figure 14. On landscapes, convolutional sampling with unconditional models can lead to homogeneous and incoherent global structures 
(see column 2). L2-guiding with a low resolution image can help to reestablish coherent global structures. 


An intriguing feature of diffusion models is that unconditional models can be conditioned at test-time [15, 82,85]. In 
particular, [15] presented an algorithm to guide both unconditional and conditional models trained on the ImageNet dataset 
with a classifier log ps (y|x+), trained on each x; of the diffusion process. We directly build on this formulation and introduce 
post-hoc image-guiding: 

For an epsilon-parameterized model with fixed variance, the guiding algorithm as introduced in [15] reads: 


€ + €6(%,t) +4/1— a? Vz, log pa(y|z) - (16) 


This can be interpreted as an update correcting the “score” €g with a conditional distribution log pa (y| zz). 

So far, this scenario has only been applied to single-class classification models. We re-interpret the guiding distribution 
pe(y|T(D(zo(zt)))) as a general purpose image-to-image translation task given a target image y, where T can be any 
differentiable transformation adopted to the image-to-image translation task at hand, such as the identity, a downsampling 
operation or similar. 
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As an example, we can assume a Gaussian guider with fixed variance ao? = 1, such that 


log po(ylzr) = —5 lly — T(P(eo(2)))IB 7) 


becomes a [2 regression objective. 
Fig. 14 demonstrates how this formulation can serve as an upsampling mechanism of an unconditional model trained on 
256? images, where unconditional samples of size 256? guide the convolutional synthesis of 512? images and T' is a 2x 


bicubic downsampling. Following this motivation, we also experiment with a perceptual similarity guiding and replace the 
Ly objective with the LPIPS [106] metric, see Sec. 4.4. 
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D. Additional Results 
D.1. Choosing the Signal-to-Noise Ratio for High-Resolution Synthesis 


KL-reg, w/o rescaling KL-reg, w/ rescaling VQ-reg, w/o rescaling 


Figure 15. Illustrating the effect of latent space rescaling on convolutional sampling, here for semantic image synthesis on landscapes. See 
Sec. 4.3.2 and Sec. D.1. 


As discussed in Sec. 4.3.2, the signal-to-noise ratio induced by the variance of the latent space (i.e. Var(z)/o7) significantly 
affects the results for convolutional sampling. For example, when training a LDM directly in the latent space of a KL- 
regularized model (see Tab. 8), this ratio is very high, such that the model allocates a lot of semantic detail early on in the 
reverse denoising process. In contrast, when rescaling the latent space by the component-wise standard deviation of the 
latents as described in Sec. G, the SNR is descreased. We illustrate the effect on convolutional sampling for semantic image 
synthesis in Fig. 15. Note that the VQ-regularized space has a variance close to 1, such that it does not have to be rescaled. 


D.2. Full List of all First Stage Models 


We provide a complete list of various autoenconding models trained on the OpenImages dataset in Tab. 8. 


D.3. Layout-to-Image Synthesis 


Here we provide the quantitative evaluation and additional samples for our layout-to-image models from Sec. 4.3.1. We 
train a model on the COCO [4] and one on the OpenImages [49] dataset, which we subsequently additionally finetune on 
COCO. Tab 9 shows the result. Our COCO model reaches the performance of recent state-of-the art models in layout-to- 
image synthesis, when following their training and evaluation protocol [89]. When finetuning from the OpenImages model, 
we surpass these works. Our OpenImages model surpasses the results of Jahn et al [37] by a margin of nearly 11 in terms of 
FID. In Fig. 16 we show additional samples of the model finetuned on COCO. 


D.4. Class-Conditional Image Synthesis on ImageNet 


Tab. 10 contains the results for our class-conditional LDM measured in FID and Inception score (IS). LDM-8 requires 
significantly fewer parameters and compute requirements (see Tab. 18) to achieve very competitive performance. Similar 
to previous work, we can further boost the performance by training a classifier on each noise scale and guiding with it, 
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f |Z| c R-FIDJ R-IS t PSNR + PSIM | SSIM t 


16 VOGAN [23] 16384 256 4.98 - 19.9 +3.4 1.83 +0.42 0.51 +0.18 
16 VOGAN [23] 1024 256 7.94 = 19.4 +3.3 1.98 40.43. 0.50 +0.18 
8 DALL-E [66] 8192 - 32.01 = 22.8 +2.1 1.95 +0.51 0.73 +0.13 
32 16384 16 31.83 40.40 +1.07 17.45 42.90 2.58 +0.48 0.41 +0.18 
16 16384 8 5.15 144.55 43.74 20.83 43.61 1.73 40.43 0.54 +0.18 
8 16384 4 1.14 201.92 43.97 23.07 +3.99 1.17 40.36 0.65 +0.16 
8 256 4 1.49 194.20 43.87 22.35 +3.81 1.26 +0.37 0.62 +0.16 
4 8192 3 0.58 224.78 +5.35 27.43 44.26 0.53 40.21 0.82 +0.10 
4t 8192 3 1.06 221.94 44.58 25.21 44.17 0.72 40.26 0.76 +0.12 
4 256 3 0.47 223.81 44.58 26434422 0.62+0.24 0.80 40.11 
2, 2048 2 0.16 232.75 +5.09 30.85 44.12 0.27 40.12 0.91 +0.05 
2 64 2 0.40 226.62 44.83 29.13 43.46 0.3840.13 0.90 +0.05 
32 KL 64 2.04 189.53 43.68 22.27 43.93 141+40.40 0.61 40.17 
32 KL 16 73 132.75 42.71 20.384+3.56 1.88+40.45 0.53 +0.18 
16 KL 16 0.87 210.31 43.97 24.08 +4.22 1.07 40.36 0.68 +0.15 
16 KL 8 2.63 178.68 +4.08 21.9443.92 149+40.42 0.59 40.17 
8 KL 4 0.90 209.90 44.92 24194419 1.02 +40.35 0.69 +0.15 
4 KL 3 0.27 227.57 44.89 27.53 44.54 O0.5540.24 0.82 +0.11 
2 KL 2 0.086 232.66 +5.16 32.47 44.19 0.20+0.09 0.93 +0.04 


Table 8. Complete autoencoder zoo trained on OpenImages, evaluated on ImageNet-Val. + denotes an attention-free autoencoder. 


layout-to-image synthesis on the COCO dataset 


Figure 16. More samples from our best model for layout-to-image synthesis, LDM-4, which was trained on the OpenImages dataset and 
finetuned on the COCO dataset. Samples generated with 100 DDIM steps and 7 = 0. Layouts are from the COCO validation set. 


see Sec. C. Unlike the pixel-based methods, this classifier is trained very cheaply in latent space. For additional qualitative 
results, see Fig. 26 and Fig. 27. 
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COCO256 x 256 OpenImages 256 x 256 OpenlImages 512 x 512 


Method FID FIDJ FID, 
LostGAN-V2 [87] 42.55 - 7 
OC-GAN [89] 41.65 - 
SPADE [62] 41.11 : - 
VQGANG+T [37] 56.58 45.33 48.11 
LDM-8 (100 steps, ours) 42.061 : 2 
LDM-4 (200 steps, ours) 40.91* 32.02 35.80 


Table 9. Quantitative comparison of our layout-to-image models on the COCO [4] and OpenImages [49] datasets. ': Training from scratch 
on COCO; *: Finetuning from OpenImages. 


Method FID| ISt Precisiont _Recallt Nparams 

SR3 [72] 11.30 - - - 625M - 

ImageBART [21] 21.19 - - - 3.5B - 

ImageBART [21] 7.44 - - - 3.5B 0.05 acc. rate* 

VQGANGT [23] 17.04 70.6418 - - 1.3B - 

VQGANGT [23] 5.88 304.843. - - 1.3B 0.05 acc. rate* 

BigGan-deep [3] 6.95 203.6426 0.87 0.28 340M - 

ADM [15] 10.94 100.98 0.69 0.63 554M 250 DDIM steps 

ADM-G [15] 4.59 186.7 0.82 0.52 608M 250 DDIM steps 

ADM-G,ADM-U [15] 3.85 221.72 0.84 0.53 n/a 2 x 250 DDIM steps 

CDM [31] 4.88 158.71+2.26 - - n/a 2 x 100 DDIM steps 

LDM-8 (ours) 17.41 72.92+26 0.65 0.62 395M 200 DDIM steps, 2.9M train steps, batch size 64 

LDM-8-G (ours) 8.11 190.43 +2.60 0.83 0.36 506M 200 DDIM steps, classifier scale 10, 2.9M train steps, batch size 64 

LDM-8 (ours) 15.51 79.03 41.03 0.65 0.63 395M 200 DDIM steps, 4.8M train steps, batch size 64 

LDM-8-G (ours) 7.76  209.52+4.24 0.84 0.35 506M 200 DDIM steps, classifier scale 10, 4.8M train steps, batch size 64 

LDM-4 (ours) 10.56 = 103.49+1.24 0.71 0.62 400M 250 DDIM steps, 178K train steps, batch size 1200 

LDM-4-G (ours) 3.95 178.2242.43 0.81 0.55 400M 250 DDIM steps, unconditional guidance [32] scale 1.25, 178K train steps, batch size 1200 
LDM-4-G (ours) 3.60 247.67 45.59 0.87 0.48 400M 250 DDIM steps, unconditional guidance [32] scale 1.5, 178K train steps, batch size 1200 


Table 10. Comparison of a class-conditional ImageNet LDM with recent state-of-the-art methods for class-conditional image generation 
on the ImageNet [12] dataset.*: Classifier rejection sampling with the given rejection rate as proposed in [67]. 


D.5. Sample Quality vs. V100 Days (Continued from Sec. 4.1) 


FID vs. V100 days 
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Figure 17. For completeness we also report the training progress of class-conditional LDMs on the ImageNet dataset for a fixed number 
of 35 V100 days. Results obtained with 100 DDIM steps [84] and & = 0. FIDs computed on 5000 samples for efficiency reasons. 


For the assessment of sample quality over the training progress in Sec. 4.1, we reported FID and IS scores as a function 
of train steps. Another possibility is to report these metrics over the used resources in V100 days. Such an analysis is 
additionally provided in Fig. 17, showing qualitatively similar results. 
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Method FID | IS t PSNRt  SSIMt 


Image Regression [72] 15.2 121.1 27.9 0.801 
SR3 [72] 52 180.1 26.4 0.762 
LDM-4 (ours, 100 steps) 2.81 /4.84 166.3 24.4438  0.69+0.14 
LDM-4 (ours, 50 steps, guiding) 4.41 /6.44 153.7 25.8437 0.74+0.12 
LDM-4 (ours, 100 steps, guiding) 4.47/6.4# 154.1 25.7437 0.734012 


LDM-4 (ours, 100 steps, +15 ep.) 2.61 /4.6  169.764503 24.4438 0.69+0.14 
Pixel-DM (100 steps, +15 ep.) 51AT/7.14 = 163.064467 24.1433 0.594012 


Table 11. x4 upscaling results on ImageNet-Val. (2567); ': FID features computed on validation split, *: FID features computed on train 
split. We also include a pixel-space baseline that receives the same amount of compute as LDM-4. The last two rows received 15 epochs 
of additional training compared to the former results. 


D.6. Super-Resolution 


For better comparability between LDMs and diffusion models in pixel space, we extend our analysis from Tab. 5 by 
comparing a diffusion model trained for the same number of steps and with a comparable number ! of parameters to our 
LDM. The results of this comparison are shown in the last two rows of Tab. 11 and demonstrate that LDM achieves better 
performance while allowing for significantly faster sampling. A qualitative comparison is given in Fig. 20 which shows 
random samples from both LDM and the diffusion model in pixel space. 


D.6.1 LDM-BSR: General Purpose SR Model via Diverse Image Degradation 


bicubic LDM-SR LDM-BSR 


Figure 18. LDM-BSR generalizes to arbitrary inputs and can be used as a general-purpose upsampler, upscaling samples from a class- 
conditional LDM (image cf. Fig. 4) to 1024? resolution. In contrast, using a fixed degradation process (see Sec. 4.4) hinders generalization. 


To evaluate generalization of our LDM-SR, we apply it both on synthetic LDM samples from a class-conditional ImageNet 
model (Sec. 4.1) and images crawled from the internet. Interestingly, we observe that LDM-SR, trained only with a bicubicly 
downsampled conditioning as in [72], does not generalize well to images which do not follow this pre-processing. Hence, to 
obtain a superresolution model for a wide range of real world images, which can contain complex superpositions of camera 
noise, compression artifacts, blurr and interpolations, we replace the bicubic downsampling operation in LDM-SR with the 
degration pipeline from [105]. The BSR-degradation process is a degradation pipline which applies JPEG compressions 
noise, camera sensor noise, different image interpolations for downsampling, Gaussian blur kernels and Gaussian noise in a 
random order to an image. We found that using the bsr-degredation process with the original parameters as in [105] leads to 
a very strong degradation process. Since a more moderate degradation process seemed apppropiate for our application, we 
adapted the parameters of the bsr-degradation (our adapted degradation process can be found in our code base at https: 
//github.com/CompVis/latent—diffusion). Fig. 18 illustrates the effectiveness of this approach by directly 
comparing LDM-SR with LDM-BSR. The latter produces images much sharper than the models confined to a fixed pre- 
processing, making it suitable for real-world applications. Further results of LDM-BSR are shown on LSUN-cows in Fig. 19. 


'Tt is not possible to exactly match both architectures since the diffusion model operates in the pixel space 


23 


E. Implementation Details and Hyperparameters 


E.1. Hyperparameters 
We provide an overview of the hyperparameters of all trained LDM models in Tab. 12, Tab. 13, Tab. 14 and Tab. 15. 


CelebA-HQ 256 x 256 FFHQ 256 x 256 LSUN-Churches 256 x 256 LSUN-Bedrooms 256 x 256 


f 4 4 8 4 
z-shape 64 x 64 x 3 64 x 64 x 3 - 64 x 64 x 3 
|Z| 8192 8192 - 8192 
Diffusion steps 1000 1000 1000 1000 
Noise Schedule linear linear linear linear 
Nparams 274M 274M 294M 274M 
Channels 224 224 192 224 
Depth 2 2 2 2 
Channel Multiplier 1,2,3,4 1,2,3,4 1,2,2,4,4 1,2,3,4 
Attention resolutions 32, 16,8 32, 16,8 32, 16, 8,4 32, 16,8 
Head Channels 32 32 24 32 
Batch Size 48 42 96 48 
Iterations* 410k 635k 500k 1.9M 
Learning Rate 9.6e-5 8.4e-5 5.e-5 9.6e-5 


Table 12. Hyperparameters for the unconditional LDMs producing the numbers shown in Tab. |. All models trained on a single NVIDIA 
A100. 


LDM-1 LDM-2 LDM-4 LDM-8 LDM-16 LDM-32 
z-shape 256 x 256 x 3 128x128x2 64x64x3 32x32x4 16x16x8 88x8&~x 32 
|Z| - 2048 8192 16384 16384 16384 
Diffusion steps 1000 1000 1000 1000 1000 1000 
Noise Schedule linear linear linear linear linear linear 
Model Size 396M 391M 391M 395M 395M 395M 
Channels 192 192 192 256 256 256 
Depth 2 2 2 2 2 2 
Channel Multiplier 1,1,2,2,4,4 1,2,2,4,4 1,2,3,5 1,2,4 1,2,4 1,2,4 
Number of Heads 1 1 1 1 1 1 
Batch Size ih 9 40 64 112 112 
Iterations 2M 2M 2M 2M 2M 2M 
Learning Rate 4.9e-5 6.3e-5 8e-5 6.4e-5 4.5e-5 4.5e-5 
Conditioning CA CA CA CA CA CA 
CA-resolutions 32, 16, 8 32, 16, 8 32, 16, 8 32, 16, 8 16, 8, 4 8, 4,2 
Embedding Dimension 512 512 512 512 512 512 
Transformers Depth 1 1 1 1 1 1 


Table 13. Hyperparameters for the conditional LDMs trained on the ImageNet dataset for the analysis in Sec. 4.1. All models trained on a 
single NVIDIA A100. 


E.2. Implementation Details 
E.2.1 Implementations of 7, for conditional LDMs 


For the experiments on text-to-image and layout-to-image (Sec. 4.3.1) synthesis, we implement the conditioner 7, as an 
unmasked transformer which processes a tokenized version of the input y and produces an output ¢ := tTo(y), where ¢ € 
IR™*4-_ More specifically, the transformer is implemented from N transformer blocks consisting of global self-attention 
layers, layer-normalization and position-wise MLPs as follows?: 


adapted from https: //github.com/lucidrains/x-transformers 
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LDM-1 LDM-2 LDM-4 LDM-8 LDM-16 LDM-32 
z-shape 256 x 256 x 3 128x128x2 64x64x3 32x32x4 16x16x8 88x8&x 32 
|Z| - 2048 8192 16384 16384 16384 
Diffusion steps 1000 1000 1000 1000 1000 1000 
Noise Schedule linear linear linear linear linear linear 
Model Size 270M 265M 274M 258M 260M 258M 
Channels 192 192 224 256 256 256 
Depth 2 2 2 2 2 2 
Channel Multiplier 1,1,2,2,4,4 1,2,2,4,4 1,2,3,4 1,2,4 1,2,4 1,2,4 
Attention resolutions 32, 16,8 32, 16,8 32, 16,8 32, 16,8 16, 8, 4 8,4, 2 
Head Channels 32 32 32 32 32 32 
Batch Size 9 11 48 96 128 128 
Iterations* 500k 500k 500k 500k 500k 500k 
Learning Rate 9e-5 1.1e-4 9.6e-5 9.6e-5 1.3e-4 1.3e-4 


Table 14. Hyperparameters for the unconditional LDMs trained on the CelebA dataset for the analysis in Fig. 7. All models trained on a 
single NVIDIA A100. *: All models are trained for 500k iterations. If converging earlier, we used the best checkpoint for assessing the 


provided FID scores. 


Task Text-to-Image Layout-to-Image Class-Label-to-Image Super Resolution Inpainting Semantic-Map-to-Image 
Dataset LAION OpenImages COCO ImageNet ImageNet Places Landscapes 
f 8 4 8 4 4 4 8 
z-shape 32x 32x4 64x64x3 32x32x4 64 x 64 x 3 64 x 64 x 3 64 x 64 x 3 32 x 32x 4 
|Z| - 8192 16384 8192 8192 8192 16384 
Diffusion steps 1000 1000 1000 1000 1000 1000 1000 
Noise Schedule linear linear linear linear linear linear linear 
Model Size 1.45B 306M 345M 395M 169M 215M 215M 
Channels 320 128 192 192 160 128 128 
Depth 2 2 2 2 2 2 2 
Channel Multiplier 1,2,4,4 1,2,3,4 1,2,4 1,2,3,5 1,2,2,4 1,4,8 1,4,8 
Number of Heads 8 1 1 1 1 1 1 
Dropout - - 0.1 - - - - 
Batch Size 680 24 48 1200 64 128 48 
Iterations 390K 4.4M 170K 178K 860K 360K 360K 
Learning Rate 1.0e-4 4.8e-5 4.8e-5 1.0e-4 6.4e-5 1.0e-6 4.8e-5 
Conditioning CA CA CA CA concat concat concat 
(C)A-resolutions 32, 16, 8 32, 16, 8 32, 16, 8 32, 16, 8 - - - 
Embedding Dimension 1280 512 512 512 - - - 
Transformer Depth 1 3 2 1 - - - 


Table 15. Hyperparameters for the conditional LDMs from Sec. 4. All models trained on a single NVIDIA A100 except for the inpainting 


model which was trained on eight V100. 


¢ + TokEmb(y) + PosEmb(y) 
fori=1,...,N: 
¢, < LayerNorm(¢) 


Cg + MultiHeadSelfAttention(¢,) + ¢ 


¢3 < LayerNorm(¢2) 
¢ + MLP(¢3) + C2 
¢ + LayerNorm(¢) 


(18) 


(19) 
(20) 
(21) 
(22) 
(23) 
(24) 


With ¢ available, the conditioning is mapped into the UNet via the cross-attention mechanism as depicted in Fig. 3. We 
modify the “ablated UNet’” [15] architecture and replace the self-attention layer with a shallow (unmasked) transformer 
consisting of J’ blocks with alternating layers of (i) self-attention, (ii) a position-wise MLP and (iii) a cross-attention layer; 
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see Tab. 16. Note that without (ii) and (iii), this architecture is equivalent to the “ablated UNet’. 

While it would be possible to increase the representational power of 79 by additionally conditioning on the time step t, we 
do not pursue this choice as it reduces the speed of inference. We leave a more detailed analysis of this modification to future 
work. 

For the text-to-image model, we rely on a publicly available* tokenizer [99]. The layout-to-image model discretizes the 
spatial locations of the bounding boxes and encodes each box as a (1, b, c)-tuple, where / denotes the (discrete) top-left and b 
the bottom-right position. Class information is contained in c. 

See Tab. 17 for the hyperparameters of 7, and Tab. 13 for those of the UNet for both of the above tasks. 


Note that the class-conditional model as described in Sec. 4.1 is also implemented via cross-attention, where 79 is a single 


learnable embedding layer with a dimensionality of 512, mapping classes y to ¢ € R!*°!, 
input R? xXwxe 
LayerNorm RPxwxe 
Conv1x1 Rhxwxd-np 
Reshape Rrwxd-ny, 
SelfAttention Rrwxdnn 
xT 4 MLP Ren 
Rew xd-np 
CrossAttention 
Reshape Rexwxd-np 
Conv1x1 Rhxwxe 


Table 16. Architecture of a transformer block as described in Sec. E.2.1, replacing the self-attention layer of the standard “ablated UNet” 
architecture [15]. Here, n;, denotes the number of attention heads and d the dimensionality per head. 


Text-to-Image Layout-to-Image 


seq-length 77 92 
depth NV 32 16 
dim 1280 512 


Table 17. Hyperparameters for the experiments with transformer encoders in Sec. 4.3. 


E.2.2 Inpainting 


For our experiments on image-inpainting in Sec. 4.5, we used the code of [88] to generate synthetic masks. We use a fixed 
set of 2k validation and 30k testing samples from Places [108]. During training, we use random crops of size 256 x 256 
and evaluate on crops of size 512 x 512. This follows the training and testing protocol in [88] and reproduces their reported 
metrics (see + in Tab. 7). We include additional qualitative results of LDM-4, w/attn in Fig. 21 and of LDM-4, w/o attn, big, 


w/ ft in Fig. 22. 
E.3. Evaluation Details 


This section provides additional details on evaluation for the experiments shown in Sec. 4. 


E.3.1 Quantitative Results in Unconditional and Class-Conditional Image Synthesis 


We follow common practice and estimate the statistics for calculating the FID-, Precision- and Recall-scores [29,50] shown in 
Tab. | and 10 based on 50k samples from our models and the entire training set of each of the shown datasets. For calculating 
FID scores we use the torch-fidelity package [60]. However, since different data processing pipelines might lead to 
different results [64], we also evaluate our models with the script provided by Dhariwal and Nichol [15]. We find that results 


3https://huggingface.co/transformers/model_doc/bert .html#berttokenizerfast 
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mainly coincide, except for the ImageNet and LSUN-Bedrooms datasets, where we notice slightly varying scores of 7.76 
(torch-fidelity) vs. 7.77 (Nichol and Dhariwal) and 2.95 vs 3.0. For the future we emphasize the importance of a 
unified procedure for sample quality assessment. Precision and Recall are also computed by using the script provided by 
Nichol and Dhariwal. 


E.3.2 Text-to-Image Synthesis 


Following the evaluation protocol of [66] we compute FID and Inception Score for the Text-to-Image models from Tab. 2 by 
comparing generated samples with 30000 samples from the validation set of the MS-COCO dataset [51]. FID and Inception 
Scores are computed with torch-fidelity. 


E.3.3. Layout-to-Image Synthesis 


For assessing the sample quality of our Layout-to-Image models from Tab. 9 on the COCO dataset, we follow common 
practice [37,87,89] and compute FID scores the 2048 unaugmented examples of the COCO Segmentation Challenge split. 
To obtain better comparability, we use the exact same samples as in [37]. For the OpenImages dataset we similarly follow 
their protocol and use 2048 center-cropped test images from the validation set. 


E.3.4 Super Resolution 


We evaluate the super-resolution models on ImageNet following the pipeline suggested in [72], i.e. images with a shorter 
size less than 256 px are removed (both for training and evaluation). On ImageNet, the low-resolution images are produced 
using bicubic interpolation with anti-aliasing. FIDs are evaluated using torch-fidelity [60], and we produce samples 
on the validation split. For FID scores, we additionally compare to reference features computed on the train split, see Tab. 5 
and Tab. 11. 


E.3.5 Efficiency Analysis 


For efficiency reasons we compute the sample quality metrics plotted in Fig. 6, 17 and 7 based on 5k samples. Therefore, 
the results might vary from those shown in Tab. 1 and 10. All models have a comparable number of parameters as provided 
in Tab. 13 and 14. We maximize the learning rates of the individual models such that they still train stably. Therefore, the 
learning rates slightly vary between different runs cf. Tab. 13 and 14. 


E.3.6 User Study 


For the results of the user study presented in Tab. 4 we followed the protocoll of [72] and and use the 2-alternative force-choice 
paradigm to assess human preference scores for two distinct tasks. In Task-1 subjects were shown a low resolution/masked 
image between the corresponding ground truth high resolution/unmasked version and a synthesized image, which was gen- 
erated by using the middle image as conditioning. For SuperResolution subjects were asked: ’Which of the two images is a 
better high quality version of the low resolution image in the middle?’. For Inpainting we asked ’Which of the two images 
contains more realistic inpainted regions of the image in the middle?’. In Task-2, humans were similarly shown the low- 
res/masked version and asked for preference between two corresponding images generated by the two competing methods. 
As in [72] humans viewed the images for 3 seconds before responding. 
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F. Computational Requirements 


Method Generator Classifier Overall Inference Nparams FIDJ. ISt Precisiont Recallt 
Compute Compute Compute Throughput* 


LSUN Churches 2567 


StyleGAN2 [42]* 64 - 64 - 59M 3.86 - - - 
LDM-8 (ours, 100 steps, 410K) 18 - 18 6.80 256M 4.02 - 0.64 0.52 


LSUN Bedrooms 2567 


ADM [15]* (1000 steps) 232 - 232 0.03 552M 1.9 - 0.66 0.51 
LDM-4 (ours, 200 steps, 1.9M) 60 - 55 1.07 274M 2.95 - 0.66 0.48 


CelebA-HQ 256? 


LDM-4 (ours, 500 steps, 410K) 14.4 - 14.4 0.43 274M 5-1 - 0.72 0.49 
FFHQ 256? 

StyleGAN2 [42] 32.13% - 32.131 - 59M 3.8 - - - 
LDM-4 (ours, 200 steps, 635K) 26 - 26 1.07 274M 4.98 - 0.73 0.50 
ImageNet 2567 

VQGAN-f-4 (ours, first stage) 29 - 29 - 55M 0.58tt - - 

VQGAN-f-8 (ours, first stage) 66 - 66 - 68M 1.14tt - - 
BigGAN-deep [3]' 128-256 128-256 - 340M 6.95 203.626 0.87 0.28 
ADM [15] (250 steps) 916 - 916 0.12 554M 10.94 100.98 0.69 0.63 
ADM-G [15] (25 steps) 916 46 962 0.7 608M 5.58 - 0.81 0.49 
ADM-G [15] (250 steps)' 916 46 962 0.07 608M 4.59 186.7 0.82 0.52 
ADM-G,ADM-U [15] (250 steps)t 329 30 349 n/a n/a 3.85 221.72 0.84 0.53 
LDM-8-G (ours, 100, 2.9M) 719 12 91 1.93 506M 8.11 190.4426 0.83 0.36 
LDM-8 (ours, 200 ddim steps 2.9M, batch size 64) 719 - 719 1.9 395M 17.41 72.92 0.65 0.62 
LDM-4 (ours, 250 ddim steps 178K, batch size 1200) 271 - 271 0.7 400M 10.56 103.49 4125 0.71 0.62 
LDM-4-G (ours, 250 ddim steps 178K, batch size 1200, classifier-free guidance [32] scale 1.25) 271 - 271 0.4 400M 3.95 178.22 42.43 0.81 0.55 
LDM-4-G (ours, 250 ddim steps 178K, batch size 1200, classifier-free guidance [32] scale 1.5) 271 - 271 0.4 400M 3.60 247.67 +559 0.87 0.48 


Table 18. Comparing compute requirements during training and inference throughput with state-of-the-art generative models. Compute 
during training in V100-days, numbers of competing methods taken from [15] unless stated differently;*: Throughput measured in sam- 
ples/sec on a single NVIDIA A100;': Numbers taken from [15] ;?: Assumed to be trained on 25M train examples; '': R-FID vs. ImageNet 
validation set 


In Tab 18 we provide a more detailed analysis on our used compute ressources and compare our best performing models 
on the CelebA-HQ, FFHQ, LSUN and ImageNet datasets with the recent state of the art models by using their provided 
numbers, cf. [15]. As they report their used compute in V100 days and we train all our models on a single NVIDIA A100 
GPU, we convert the A100 days to V100 days by assuming a x 2.2 speedup of A100 vs V100 [74]*. To assess sample quality, 
we additionally report FID scores on the reported datasets. We closely reach the performance of state of the art methods as 
StyleGAN2 [42] and ADM [15] while significantly reducing the required compute resources. 


4This factor corresponds to the speedup of the A100 over the V100 for a U-Net, as defined in Fig. 1 in [74] 
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G. Details on Autoencoder Models 


We train all our autoencoder models in an adversarial manner following [23], such that a patch-based discriminator Dy 
is optimized to differentiate original images from reconstructions D(€(x)). To avoid arbitrarily scaled latent spaces, we 
regularize the latent z to be zero centered and obtain small variance by introducing an regularizing loss term L;-¢,. 
We investigate two different regularization methods: (i) a low-weighted Kullback-Leibler-term between g¢e(z|x) = 
N (2; E,,,€,2) and a standard normal distribution N(z;0, 1) as in a standard variational autoencoder [46, 69], and, (ii) regu- 
larizing the latent space with a vector quantization layer by learning a codebook of | Z| different exemplars [96]. 
To obtain high-fidelity reconstructions we only use a very small regularization for both scenarios, i.e. we either weight the 
KL term by a factor ~ 10~° or choose a high codebook dimensionality |Z]. 

The full objective to train the autoencoding model (€, D) reads: 


Lautoencoder _ min ae (Lrec(e, D(E(x))) = Ladies (D(E(x))) - log Dy (x) + Drag; Gy D)) (25) 


DM Training in Latent Space Note that for training diffusion models on the learned latent space, we again distinguish two 
cases when learning p(z) or p(z|y) (Sec. 4.3): (i) For a KL-regularized latent space, we sample z = €,,() +€,()-e =: E(x), 
where ¢ ~ \’(0, 1). When rescaling the latent, we estimate the component-wise variance 


1 
a b,c,hyw  ay2 
° * behw d, & H) 


b,c,h,w 


from the first batch in the data, where f1 = ae er z>¢-h.w The output of € is scaled such that the rescaled latent has 


1 
bchw 
unit standard deviation, i.e. z <- 4 = Ete) (ii) For a VQ-regularized latent space, we extract z before the quantization layer 
and absorb the quantization operation into the decoder, i.e. it can be interpreted as the first layer of D. 


H. Additional Qualitative Results 


Finally, we provide additional qualitative results for our landscapes model (Fig. 12, 23, 24 and 25), our class-conditional 
ImageNet model (Fig. 26 - 27) and our unconditional models for the CelebA-HQ, FFHQ and LSUN datasets (Fig. 28 - 31). 
Similar as for the inpainting model in Sec. 4.5 we also fine-tuned the semantic landscapes model from Sec. 4.3.2 directly on 
512? images and depict qualitative results in Fig. 12 and Fig. 23. For our those models trained on comparably small datasets, 
we additionally show nearest neighbors in VGG [79] feature space for samples from our models in Fig. 32 - 34. 
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bicubic LDM-BSR 


. 4 u i Ur s x 
Figure 19. LDM-BSR generalizes to arbitrary inputs and can be used as a general-purpose upsampler, upscaling samples from the LSUN- 
Cows dataset to 1024? resolution. 
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input GT Pixel Baseline #1 Pixel Baseline #2 LDM #1 LDM #2 


Figure 20. Qualitative superresolution comparison of two random samples between LDM-SR and baseline-diffusionmodel in Pixelspace. 
Evaluated on imagenet validation-set after same amount of training steps. 
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input GT LaMa [88] LDM #1 LDM #2 LDM #3 
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Figure 21. Qualitative results on image inpainting. In contrast to [88], our generative approach enables generation of multiple diverse 
samples for a given input. 
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Figure 22. More qualitative results on object removal as in Fig. 
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Semantic Synthesis on Flickr-Landscapes [23] (512? finetuning) 


Figure 23. Convolutional samples from the semantic landscapes model as in Sec. 4.3.2, finetuned on 512? images. 
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Figure 24. A LDM trained on 256? resolution can generalize to larger resolution for spatially conditioned tasks such as semantic synthesis 
of landscape images. See Sec. 
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Semantic Synthesis on Flickr-Landscapes [23] 


Figure 25. When provided a semantic map as conditioning, our LDMs generalize to substantially larger resolutions than those seen during 
training. Although this model was trained on inputs of size 2567 it can be used to create high-resolution samples as the ones shown here, 
which are of resolution 1024 x 384. 36 


Random class conditional samples on the ImageNet dataset 


Figure 26. Random samples from LDM-4 trained on the ImageNet dataset. Sampled with classifier-free guidance [32] scale s = 5.0 and 
200 DDIM steps with 7 = 1.0. 
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Random class conditional samples on the ImageNet dataset 


Figure 27. Random samples from LDM-4 trained on the ImageNet dataset. Sampled with classifier-free guidance [32] scale s = 3.0 and 
200 DDIM steps with 7 = 1.0. 
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Random samples on the CelebA-HQ dataset 


Figure 28. Random samples of our best performing model LDM-4 on the CelebA-HQ dataset. Sampled with 500 DDIM steps and 7 = 0 
(FID = 5.15). 
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Random samples on the FFHQ dataset 


Figure 29. Random samples of our best performing model LDM-4 on the FFHQ dataset. Sampled with 200 DDIM steps and 7 = 1 (FID 


4.98). 
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Random samples on the LSUN-Churches dataset 
if 


Figure 30. Random samples of our best performing model LDM-8 on the LSUN-Churches dataset. Sampled with 200 DDIM steps and 
7 = 0 (FID = 4.48). 
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Random samples on the LSUN-Bedrooms dataset 
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Figure 31. Random samples of our best performing model LDM-4 on the LSUN-Bedrooms dataset. Sampled with 200 DDIM steps and 
n = 1 (FID =2.95). 
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Nearest Neighbors on the CelebA-HQ dataset 


Figure 32. Nearest neighbors of our best CelebA-HQ model, computed in the feature space of a VGG-16 [79]. The leftmost sample is 
from our model. The remaining samples in each row are its 10 nearest neighbors. 
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Nearest Neighbors on the FFHQ dataset 


computed in the feature space of a VGG-16 [79]. The leftmost sample is from our 


Figure 33. Nearest neighbors of our best FFHQ model, 


model. The remaining samples in each row are its 10 nearest neighbors. 
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Nearest Neighbors on the LSUN-Churches dataset 


Figure 34. Nearest neighbors of our best LSUN-Churches model, computed in the feature space of a VGG-16 [79]. The leftmost sample 
is from our model. The remaining samples in each row are its 10 nearest neighbors. 
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